P-$86 

-1- 


SIWMAFY 
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Results  of  Johnson  and  Karlin,  P— 328,  are  obtained  In  a 
different  way  and  extended.  The  methods  used  are  ippll- 


cable  to  more  general  processes. 
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A  PROBLEM  IN  THE  SEQUENTIAL  DESIGN  OP  EXPERIMENTS 

Richard  bellman 


f 1 .  Introduction 

In  two  little-known  papers  written  In  1933  and  1939,  [lj , 
[l^j,  w*  Thompson  propoeed  the  problem  of  determining  on  the 
baele  of  Sequential  analysis  which  of  two  drugs  were  superior. 

The  problem  la  a  difficult  one,  and  Thompson  concentrated  his 
efforta  on  the  ooaiputatlon  of  the  effects  of  a  plausible  policy, 
and  on  a  Monte  Carlo  determination  of  the  outcome. 

A  problem  In  the  3ame  general  area  was  discussed  by 
Maha?.anobls ,  [ll]  ,  [12]  ,  In  connection  with  a  sampling  survey  of 
the  acreage  under  Jute  In  Bengal. 

An  interesting  exposition  of  the  general  problem  Is  given  by 
Robbins,  [13]  #  where  further  references  may  be  found.  The  connec¬ 
tion  with  the  Wald  theory  of  sequential  analysis  Is  discussed,  and 
further  problems  In  this  field  are  presented. 

The  problem  Is  also  In  the  general  field  of  "learning  pro¬ 
cessed,  where  we  must  determine  the  structure  of  a  process  while 
carrying  on  an  experiment,  cf  [8  ]  ,  [lo]  ,  [13]  , 


•  We  confess  that  we  found  these  papers  In  tne  standard  fashion, 
namely  while  thumbing  through  a  Journal  containing  another  paper 
of  interest. 


in  a  recent  f«P*r,  ,  ohn.w-'n  end  -  ^ r  1  In  cons  ldere  j  a 
particular  version  of  the  Thonv.,.n  .rofc.ea,  essentially  the  case 
where  one  drug  has  Known  properties  and  the  other  unknown,  and 
derived  a  number  o!  Interesting  result  concerning  the  structure 
of  an  optimal  policy. 

In  this  paper  ,  we  mil  consider  their  problem  and  an  analogous 
problem  by  mean  a  ;1  suasion  uf  the  functional  equation  derived 

from  the  process.  Ublr.g  techniques  we  have  employed  In  various 
partt  f  the  theory  of  dynamic  programming,  [l]  ,  [2]  ,  [*]  ,  [7]  , 
we  snail  determine  the  structure  of  the  optimal  policy  arvd  complete 
the  Johnson-Karl In  results  In  an  essential  detail. 

in  §2  we  present  a  ;reclse  formulation  of  the  problem  we 
treat  here.  In  we  derive  the  basic  functional  equation,  with 
properties  of  the  solution,  existence,  uniqueness  and  successive 
approximations  dlscusred  In  fk.  The  next  section  contains  a 
statement  of  the  results  we  obtain  concerning  the  structure  of 
the  optimal  policy.  In  f 6  we  present  a  prtof  of  there  results. 
Finally  In  we  discuss  the  numerical  computation  of  the  solution 
based  upon  •  ucca:--lvf  approximations. 

The  methods  we  employ  are  applicable  to  more  general  processes, 
as  we  shall  chow  in  further  papers. 

§2 .  Formulation  of  the  problem 

L*t  u  Rtrun»  that  we  have  two  luuhlne  ,  unimaginatively 
called  I  and  II,  with  the  following  propertie  .  If  machine  I  is 
used,  ther*  Is  a  probability  "  of  receiving  a  gain  of  one  unit. 
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and  a  probability  1-r  of  receiving  nothing.  If  machine  II  1b 
used,  there  le  a  corresponding  probability  of  a. 

Unfortuna tely  these  probabilities  are  not  known.  We  do, 
however,  possess  an  a  priori  probability  distribution  for  their 
values,  F(r, a). 

We  may  now  consider  either  a  finite  sequence  of  choices  where 
we  have  n  trials,  or  an  unbounded  process  with  a  discount  factor  a 
fe»  the  value  of  a  unit  received  one  trial  away.  The  infinite 
process  is  simpler  analytically  since  It  possesses  an  Invariant 
aspect  over  time.  Our  methods  are  equally  applicable  to  both 
types  of  processes. 

The  problem  is  now  to  determine  the  sequence  of  choices  which 
maximizes  the  total  expected  return.  This  sequence  la  In  general 
•tmalMmtla  since  the  choice  after  any  finite  number  of  choices  will 
depend  upon  the  outcomes  of  the  preceding  choices. 

In  this  paper,  we  shall  consider  only  the  simple  catie  where 
r  and  s  are  uncorrelated ,  and  even  further  we  shall  assume  that 
s  is  known.  Let  P(r)  be  the  distribution  function  for  r  In 

Con]. 

/).  The  Basic  Functional  Equation 

We  shall  utilize  sn  analytic  approach  based  upon  a  functional 
equation  associated  with  the  problem.  Let  us  define 


(1)  f  (s)  •  expected  return  obtained  using  an  optimal  policy 
m  |  n 

for  an  unbounded  process  after  the  first  machine 
has  had  m  successes  and  n  failures. 


Our  fundamental  assumption  Is  the  usual  one  that  the  new 
l  priori  distribution  function  after  m  successes  and  n  failures 
on  the  first  (unxnown)  machine  Is  $lver  by 


(2)  <JF  (r) 

inn 


On  the  basis  of  this  assumption,  an  enumeration  of  outcomes 
yields  the  relation,  If  the  first  taachlne  is  chosen, 

(J)  '  So'ra\n(r)  C  1  *  afm*l.n(s)  3 

'/>‘(1-r)dPmn(r)  C  3- 

On  the  other  hand,  If  the  second  (Known)  machine  Ip  chosen, 

we  have 


Hence  we  obtain  the  fundamental  recurrence  relation 

,5)  fm.n(9)  •  ’,a*  I:  ^lrdV.(r)  t  1  *  ar»+l,n(a)  3 

‘/o1(1-r>£lPmn(r)  C  af0i,n*i(a)  3. 

II:  s /( 1-a ) . 


a  typical  functional  equation  in  the  theory  of  dynaral"  programming. 
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Let  ub  now  Introduce  some  simplifying  notation.  Write 


(6)  fm,n^8^  * 

^1nSF-n(r)  •  b(m,n). 

Then  (5)  takes  the  simpler  foi 


(7)  f(m,n)  -  Max 


for  m,n  £  0. 


Is  b(m,n)  [  1  ♦  af(m+l,n)  ]  ♦  a  ( l-b(m  ,n)  f  (m,n+ 

II:  */(!-•) 


Let  us  note  that  0  <  a,  s  <  1,  and  that  0  <  b{m,n)  <  1  for 
m,n  ^  0. 

/*.  Existence  and  Uniqueness  of  Solution 


Since  our  analysis  of  the  structure  of  the  optimal  policy  will 
be  based  upon  a  continued  application  of  successive  approximations 
to  the  system  In  (3.7) ,  It  is  essential  to  have  an  existence  and 
uniqueness  theorem  and  information  concerning  the  convergence  of 
successive  approximations. 

The  method  we  employ  is  equa'ly  applicable  to  other  functional 
equations  in  dynamic  programming  and  examples  may  be  found  In  [l]  , 
[2j.  [*].  [5j.  W. 

Theorem  1.  There  Is  a  unique  solution  to  (3*7),  which 

is  uniformly  bounded  by  l/(l-e)  for  all  n  and  n  >  0.  This  solution 

may  be  obtained  as  the  limit  is  K  ->  ®  of  the  sequence  ^f^ 

defined  recurrently  by 
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I 

1(1)  f0(«,n)  -  g(m,«) 

*  T»n(V-  k  *  >  0. 

I 

where  we  set 


* 

i 

1 

f 

i 

i 


i 

i 

f 


1:  b(mfn)(l  ♦  af(an-l,n))  ♦  (l-b(m,n))  [  af(«,n+l)  ] 

111  e/(  1-a ) 

Here  |g(m,r)J  may  be  any  gequer.ce  uni formly  bounded  by  l/(l-a). 

Proof :  Let  us  define 

(3)  u1(f,m,n)  -  b(ia,n)  [  1  ♦  af(nM>l,n)  ]  ♦  a  [  l-b(«,n)  ]  f(«,n-fl) 

UIX  (  f  #n )  •  s/(l-a) 

Then  for  each  m,n  >  0,  we  have 

(4)  fk4l(«,n)  -  uA(fk,Bi,n), 

where  A  -  I  or  II  and  the  choice  1?  dependent  upon  m,n  and  f  . 

Similarly 

(5)  fk(«,n)  -  uQ( T '  ,m,n), 

where  B  may  equal  A. 

In  any  case,  we  have,  by  virtue  of  the  recurrence  relation 
of  (1),  the  lnequalltle 


(2)  T^ff)  -  Max 


(6)  f<4l(m,n)  -  uA(f(<,m,n)  >  u0(fK,»,n) 

fK(m,n)  •  ug(fk_1,m,n)  >  UA ( f k-1 .■ * n )  • 


f 

1 
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Hence 


P)  fk4l(«i»n)  -  fk(*.n)  ^  uA(fk,«,n)  -  uA(fk_i»"»n) 

£  u^(fk »■#«)  -  Tic_1  ,«#n) 


These  inequalities  yield 


(8)  lfk4i("»n)  -  fk(«»n)|  <  Hax 


[  I ux ( fk ,*,n)  -  uI(fk_i#m»n) |  . 

^k_i »®»^) i  3 * 


or,  using  the  analytic  expression  for  Uj  and  Ujj, 

(9)  lfk>it*»n)  -  tk(m,n)l  £  ab(a,n)|fk(n*l,n)  -  fk_1(»^l,n)| 

♦  a(l-b(a,n))|fk(a,n*l)  -  fk_1(«.h^l)l 

^  a  Nax  [  |fk(»^l,n)  -  fk-1(sM-l ,n)  |, 

|fk(«,n>l)  -  fk_1(a,rvfl)  |  ]. 

If  we  set 

(10)  uk  -  Sup  |  fk(«,n)  -  *‘k_1(B,»n)  I* 

a,n£0 

the  inequality  in  (9)  yields 

<n>  vn  **v 

Proa  this  it  follows  that  the  series 

<9 

(12)  S(a,n)  -  V  ^fk4l^“»n^  “  fk("»n))» 

fay 

converges  uniformly  in  ■  and  n  for  *,n>o ,  and  that  fk(m,n)  — ►  f(«,n) 
as  k  — e  od  . 


It  is  readily  verified  that 


is  uniformly  bounded 


by  1  /( 1— « )  for  0£s£l  If  this  holds  for  ^g(m,n)^  . 

To  establish  uniqueness,  let  ^F(m,n)^  be  another  solution, 
uniformly  bounded  by  l/l-a,  or  any  fixed  quantity. 

Proceeding  as  in  (4)  -  (9),  we  obtain  the  inequality 

(13)  |  f (*,n )  -  F(m,n)|  <  a  Max  £  |f(n»+l,n)  -  F(**l,n)|, 

|f(m,n+l)  -  F(m,n4l)j  ]. 


Setting 

(14)  u  -  Sup  |f(m,n)  -  F(m,n)j  , 

m,n£0 

the  Inequality  in  (13)  yields 

(15)  u  <  au, 

and  consequently  the  result  that  u  •  0  or  f(m,n)  •  F(m,n). 
i $.  Statement  of  Results 

Let  us  now  state  the  results  we  shall  prove  concerning  the 
structure  of  the  solution  of  (3.7).  Observe  that  it  is  the  "policy", 
l.e.  the  value  of  s  which  dictates  a  choice  of  machine  1  or  machine  II 
which  determines  the  solution. 

Theorem  2.  For  each  m,n>0,  the  re  is  a  unique  quantity  s(m,n)  with 
the  property  that 


(1)  (a)  f (■  ,n )  -  8/(l-a),  1  £  s  £  s(m,n), 

(b)  -  b(m,n )  [  1  ♦  af(»tH-l,n)  ]  ♦  a(  l-b(m,n)  )f  (a,n+l ) , 
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0  £  3  <  a (m,n ) . 

The  sequence  f  a  ( a ,  n  )V  has  the  following  p rope rt lee 

(2)  s  (m+1 #n )  >  s(«,n)  >  s(m,n+l), 

and  the  sequence  £f(m,n)^  similarly  satisfies  the  relation? 

(3)  f(mel,n)  >  f (m,n )  >  f(*,n+l) 

Analogous  results  hold  for  the  case  of  a  finite  numo*»r  of 
trials. 

The  proof  which  we  shall  present  In  the  next  section  will  be 
based  on  the  method  of  successive  approximations. 

§ 6.  Proof  of  Theorem  2 


by  . 
(1) 


We  shall  approximate  to  the  solution  of  the  original  equation 
>ans  of  the  sequence  £fk(m,n)^  defined  as  follows 

f0(m#n)  -  Max  [b(m,n),  s/(l-e)], 
fk4l(a.n)  -  T«n(fic)*  k  "  0,1,2,...,  a,n  £  0. 


We  wish  to  pmove  the  fallowing  statements 


(2) 


(•) 

(b) 


for  all  k  >  0,  there 
property  that 

(1)  for  s  £  sk(m,n), 

(2)  for  o  £  sk(m,n) , 


Is  a 


sequence 


fk4i(»»n)  -  s/(l-a), 

fk4l(*»n)  -  uj ( ffc »*»n ) • 


with  the 
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(c)  fk(»^l,n)  >  fk(»,n)  >  fk(a,n-fl), 

(d)  sk(mel,n)  >  «k(«,n)  >  sk(m,n^l), 

(e)  sk^l^",n^  >  e* 

Let  us  begin  with  the  case  k  •  0.  Let  fsc(m,n)j-  be  the 

sequence  determined  by  the  equation 

(J)  s/(l-a)  -  b(m,n). 

Then  (2b)  Is  clearly  true  for  k  •  0.  To  obtain  further  relations 
we  require  the  Inequalities 

(4)  b(ai,n+l)  <  b(m,n)  <  b(»el,n) 
for  all  a,n  £  0. 

The  second  inequality  Is  equivalent  to 

(5)  r"”2(l-r)ndP 

^1r^(l-r)ndP  S0lr**li  l-r)ndF 

or 

(6)  (SjMO)*  <  IS^M)  ( ,/0  VdO) 

where  dO  •  r*(l— r)ndP.  This,  however.  Is  the  Cauchy-Schwarz 
Inequality 

It  iiay  readily  be  verified  that  the  first  Inequality  Is  also 
equivalent  to  (6). 

These  Inequalities,  (4),  yield  (2c)  and  (2d)  for  k  •  0. 
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Let  us  now  b*gln  the  Induction.  We  have 

(7)  ^(■.n)  •  Tm(TQ). 

Since  s  (*r,n)  Is  determined  by  the  equality  of  the  two 

expressions  In  T  (f  ),  It  le  clear  that 

an  0 

(8)  *1(«,n)  >  a0(ra#n). 

Let  us  now  deaonrtrate  the  essential  result  that 


(9)  c1(»4l#n)  >  *’1(m,n)  >  s^m.n+l) 

Consider  the  equation  for  s^(«,n).  We  have,  with  r.  -  s^(a,n 

-  b(ra,r»)  ♦  ab(m  ,n)f0(*H-l ,n  j  4  a  (  1  -b  (a  ,n ) )  f(  (a,n*.j- 
However,  since 

(11)  s^a.n)  >  s  .  ( a  #  n )  >  s^a.n**), 

we  have  for  this  value  of  s 

(12)  r0(«,n*l)  •  s/( 1-a ) 

Hence  (10)  reduce  to 


that 
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(15)  (l-a)»(m»l,n) 

(!-•>  ♦  ab(mel,n) 


>  ■ 

( 1 — « )  ♦  ab(m,n) 


Since  f0(aM-l,n)  Is  monotone  Increasing  In  m  for  all  n  and  a.  It 
follows  that  the  curve 


(16)  u_.„(s)  -  (l-a)b(m*i,n)  (1  ♦  afj^.n)), 

d->  ♦  .b(»l,n| 

lies  above  the  curve  for  uMt  i  ( B )  •  Hence  e^(m+l,n)  >  8^(m,n).  The 
same  proof  shows  that  s^(m#n-*l)  <  s  (m,n). 

The  last  step  of  the  induction  consists  of  the  proof  that 


(17)  fjfm+i.n)  >  f1(m,n)  >  f^tn.n+l) 


Consider  the  proof  of  the  first  of  these  Inequalities.  *e 


have 


(18)  f1(m^l#n)  -  Max 


b(m *ltn)  ♦  ab(m*  1  ,n  )  f  ^  (m-*-2  ,n  )  ♦  a  ( 1  -b (■►*•  .  ,n  )  ) 

f0(m+l ,n+l) 

s/(l-a) 


Let  us  set 


(15)  -  f0(sH-l,n), 

bx  -  f0(me2#n), 

A  -  b(w,n). 


a2  "  f0(*#n^l) 
b2  “  f0(**l»n*l) 

-  b(*>i  ,n) 


The  location  of  a2»t>2*ai»b]  on  the  real  axis  Is  as  follows: 


♦- 

0 


*2 


— f 

81 
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by  virtue  of  the  Inequalities  holding  for  fja,n).  Since  p  >A  , 

It  la  sufficient  to  show  that  the  convex  combination  ♦  (1-^a  )b2 

la  greater  than  or  equal  to  the  convex  combination  Aa^  ♦  (1—  A)a?. 

Consider  the  linear  expression  E(  A  )  •  ♦  (1-A)a?  for 

0  1  ^  l/4*  At  A  »yu-  ,  «*•  have 

(20)  I(yu .)  -  ^aj  ♦  ♦  (l-^c)b2 

At  A-  0,  we  have 

(21)  «(0)  -  a2  <  ^  ♦  (l-^)b2 
for  any  >u-£  0. 

Consequently,  for  all  values  of  A  in  the  Interval  £  0,^0 

have 

(22)  E(  A  )  <  ^bx  ♦  (l-^)b2. 

Comparing  the  expression  for  f^(m+l,n)  with  that  for  f^(a,n) 

It  follows  that  f^(m*l,n)  >  f^(a,n).  The  Inequality  f^(a,n)  >  f^w.n+l) 
la  derived  similarly. 

We  now  have  all  the  details  required  for  an  Inductive  proof 
which  proceeds  from  k  and  k+1  In  precisely  the  fashion  above. 

Since  the  Inequalities  are  valid  for  all  k,  they  are  valid 
for  the  limiting  sequence  ^ffm.n)^  ,  with  strict  Inequality  because 
of  the  strict  Inequality  In  the  relation  b(m+l,n)  >  b(m,n). 
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.  Discussion 


It  seems  to  be  a  very  difficult  problem  to  determine  the 
precise  analytic  fora  of  s  .  Consequently  the  most  efficient 

•nil 

metnod  of  determining  this  sequence  Is  probably  by  means  of 
successive  approximations  starting  with  a  suitable  f0(m,n). 

It  Is  worth  pointing  out  that  In  place  of  starting  out  with 
an  Initial  approximation,  |YQ(m,n)y  ,  It  Is  better  to  guess  an 
Initial  policy,  •  It  Is  simpler,  and  more  natural,  to  choose 

a  sequence  of  values  rather  than  a  sequence  of  functions.  Further¬ 
more,  we  have  a  much  stronger  feel  for  an  approximate  policy  than 
we  do  for  an  approximate  function,  cf  [l]  ,  ,  [6],  where  this 

Idea  Is  directed  to  other  applications. 

It  Is  quite  surprising  that  It  Is  so  difficult  to  prove  the 
Intuitively  obvious  relations  f(a»l,n)  >  f(m,n)  >  f(m,n+l).  There 
should  be  another  formulation  which  makes  this  result  obvious  at  a 
glance . 
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